Supervised or unsupervised & model types¶
Peer Herholz (he/him)
Postdoctoral researcher - NeuroDataScience lab at MNI/McGill, UNIQUE
Member - BIDS, ReproNim, Brainhack, Neuromod, OHBM SEA-SIG
@peerherholz
Aim(s) of this section¶
learn about the distinction between supervised & unsupervised machine learning
get to know the variety of potential models within each
Outline for this section¶
supervised vs. unsupervised learning
supervised learning examples
unsupervised learning examples
A brief recap & first overview¶
let’s bring back our rough analysis outline that we introduced in the previous section

so far we talked about how a Model (M) can be utilized to obtain information (output) from a certain input
the information requested can be manifold but roughly be situated on two broad levels:
learning problem
supervised or unsupervised
specific task type
predicting clinical measures, behavior, demographics, other properties
segmentation
discover hidden structures
etc.
https://scikit-learn.org/stable/_static/ml_map.png
https://scikit-learn.org/stable/_static/ml_map.png
Learning problems - supervised vs. unsupervised¶

if we now also include task type we can basically describe things via a 2 x 2 design:

Our example dataset¶
Now that we’ve gone through a huge set of definitions and road maps, let’s go away from this rather abstract discussions to the “real deal”, i.e. seeing how these models behave in the wild. For this we’re going to sing the song “Hello example dataset my old friend, I came to apply machine learning to you again.”. Just to be sure: we will use the example dataset we briefly explored in the previous section again to showcase how the models we just talked about can be put into action, as well as how they change/affect the questions we can address and we have to interpret the results.
At first, we’re going to load our input data, i.e. X again:
import numpy as np
data = np.load('MAIN2019_BASC064_subsamp_features.npz')['a']
data.shape
(155, 2016)
just as a reminder: what we have in
Xhere is avectorized connectivity matrixcontaining2016features, which constitutes the correlation between brain region-specific time courses for each of155samples(participants)
as before, we can visualize our
Xto inspect it and maybe get a first idea if there might be something going on
import plotly.express as px
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot
fig = px.imshow(data, labels=dict(x="features", y="participants"), height=800, aspect='None')
fig.update(layout_coloraxis_showscale=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'input_data.html')
display(HTML('input_data.html'))
at this point we already need to decide on our
learning problem:do we want to utilize the information we already have (
labels) and thus conduct asupervised learninganalysis to predictYdo we not want to utilize the information we already have and thus conduct an
unsupervised learninganalysis to e.g. find clusters or decompose
please note: we only do this for the sake of this workshop! Please never do this type of “Hm, maybe we do this or this, let’s see how it goes.” approach in your research. Always make sure you have a precise analyses plan that is informed by prior research and guided by the possibilities of your data. Otherwise you’ll just add to the ongoing reproducibility and credibility crisis, not accelerating but hindering scientific progress. (However, the other option is that you conduct exploratory analyses and just be honest about it, not acting as they are confirmatory.)
that being said: we’re going to basically test of all them (talking about “to not practise what one preaches”, eh?), again, solely for teaching reasons
within a given
learning problem, we will go through a couple of the most heavily usedestimators/algorithmsand give a little bit of information about eachsupervised learning: SVM, regression, nearest neighbor, tree-ensembles
unsupervised learning: PCA, kmeans, hierarchical clustering
we’re going to start with
supervised learning, thus using the information we already have
Supervised learning¶
independent of the precise
task typewe want to run, we initially need to load the information, i.e.labels, available to us:
import pandas as pd
information = pd.read_csv('participants.csv')
information.head(n=5)
| participant_id | Age | AgeGroup | Child_Adult | Gender | Handedness | |
|---|---|---|---|---|---|---|
| 0 | sub-pixar123 | 27.06 | Adult | adult | F | R |
| 1 | sub-pixar124 | 33.44 | Adult | adult | M | R |
| 2 | sub-pixar125 | 31.00 | Adult | adult | M | R |
| 3 | sub-pixar126 | 19.00 | Adult | adult | F | R |
| 4 | sub-pixar127 | 23.00 | Adult | adult | F | R |
as you can see, we have multiple variables, i.e.
labelsdescribing our participants, i.e.samplesand almost each of them can be used to address asupervised learningproblem (e.g.Child_Adult)
goal: Learn parameters (or weights) of a model (
M) that mapsXtoy
however, while some are
categoricaland thus could be employed within aclassificationanalysis, some arecontinuousand thus would fit within aregressionanalysis (e.g.Age)
we’re going to check both
Supervised learning - classification¶
goal: Learn parameters (or weights) of a model (
M) that mapsXtoy
in order to run a
classificationanalysis, we need to obtain the correctcategorical labelsdefining them as ourY
Y_cat = information['Child_Adult']
Y_cat.describe()
count 155
unique 2
top child
freq 122
Name: Child_Adult, dtype: object
we can see that we have two unique expressions, but let’s plot the distribution just to be sure and maybe see something important/interesting:
fig = px.histogram(Y_cat, marginal='box', template='plotly_white')
fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
that looked about right and we can continue with our analysis
to keep things easy, we will use the same pipeline we employed in the previous section, that is we will scale our input data, train a Support Vector Machine and test its predictive performance:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
pipe = make_pipeline(
... StandardScaler(),
... SVC()
... )
A bit of information about Support Vector Machines:
non-probabilistic binary classifier
samples are in one of two classes
utilization of hyperplane as decision boundaries
n feature dimensions - 1
support vectors
small vs. large margins
Pros
effective in high dimensional spaces
Still effective in cases where number of dimensions is greater than the number of samples.
uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
versatile: different Kernel functions
Cons
if number of features is much greater than the number of samples: danger of over-fitting
make sure to check kernel and regularization
SVMs do not directly provide probability estimates
before we can go further, we need to divide our input data
Xintotrainingandtestsets:
X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)
and can already fit our
analysis pipeline:
pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
followed by testing the
model’s predictive performance:
print('accuracy is %s with chance level being %s' %(accuracy_score(pipe.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy is 0.8974358974358975 with chance level being 0.5
(spoiler alert: can this be right?)
Supervised learning - regression¶
after seeing that we can obtain a super high accuracy using a
classificationapproach, we’re hooked and want to check if we can get an even better performance via addressing our learning problem via aregressionapproach
for this to work, we need to change our
labels, i.e.Yfrom acategoricalto acontinuousvariable:
information.head(n=5)
| participant_id | Age | AgeGroup | Child_Adult | Gender | Handedness | |
|---|---|---|---|---|---|---|
| 0 | sub-pixar123 | 27.06 | Adult | adult | F | R |
| 1 | sub-pixar124 | 33.44 | Adult | adult | M | R |
| 2 | sub-pixar125 | 31.00 | Adult | adult | M | R |
| 3 | sub-pixar126 | 19.00 | Adult | adult | F | R |
| 4 | sub-pixar127 | 23.00 | Adult | adult | F | R |
here
Ageseems like a good fit:
Y_con = information['Age']
Y_con.describe()
count 155.000000
mean 10.555189
std 8.071957
min 3.518138
25% 5.300000
50% 7.680000
75% 10.975000
max 39.000000
Name: Age, dtype: float64
however, we are of course going to plot it again (reminder: always check your data):
fig = px.histogram(Y_con, marginal='box', template='plotly_white')
fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)
#fig.show()
plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
the only thing we need to do to change our previous
analysis pipelineaclassificationto aregressiontask is to adapt theestimatoraccordingly:
from sklearn.linear_model import LinearRegression
pipe = make_pipeline(
... StandardScaler(),
... LinearRegression()
... )
A bit of information about regression
modelling the relationship between a scalar response and one or more explanatory variables
Pros
simple implementation, efficient & fast
good performance in linear separable datasets
can address overfitting via regularization
Cons
prone to underfitting
outlier sensitivity
assumption of independence
the rest of the workflow is almost identical to the
classificationapproachafter splitting the data into
trainandtestsets:
X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
we
fitthepipeline:
pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])
which predictive performance can then be evaluated:
from sklearn.metrics import mean_absolute_error
print('mean absolute error in years: %s against a data distribution from %s to %s years' %(mean_absolute_error(pipe.predict(X_test), y_test), Y_con.min(), Y_con.max()))
mean absolute error in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years
Question: Is this good or bad?
Having spent a look at classification and regression via respectively common models we will devote some time to two other prominent models that can be applied within both tasks. (For the sake of completeness, please note that SVMs can also be utilized within regression tasks, changing from a support vector classifier to a support vector regression.)
Supervised learning - nearest neighbor¶
non-parametric method
distribution-free or
specific distribution + unspec. parameters
classification and regression
class or object property value
k-nearest neighbors
sensitive to local structure of data
Pros
intuitive and simple
no assumptions
one hyperparameter
variety of distance parameters
Cons
slow and sensitive to outliers
curse of dimensionality
requires homogeneous features and works best with balanced classes
how to determine k
as before, changing our pipeline to use
k-nearest neighbororknnas theestimatoris very easy
we just need to import the respective class and put it into our pipeline:
from sklearn.neighbors import KNeighborsClassifier
pipe = make_pipeline(
... StandardScaler(),
... KNeighborsClassifier()
... )
given we can tackle both,
classificationandregressiontasks, we will actually do both and compare the outcomes to the results we got before using differentestimators
let’s start with
classificationfor which we need ourcategorical labels:
Y_cat.describe()
count 155
unique 2
top child
freq 122
Name: Child_Adult, dtype: object
by now you know the rest, we divide into
trainandtestset, followed by fitting ouranalysis pipelineand then testing its predictive performanceto ease up the comparison with the
SVM, we will pack things into a smallfor-loop, iterating over the two different pipelines
X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)
pipe_knn = make_pipeline(
StandardScaler(),
KNeighborsClassifier(n_neighbors=3))
pipe_svc = make_pipeline(
StandardScaler(),
SVC())
for pipeline, name in zip([pipe_svc, pipe_knn], ['SVC', 'kNN']):
pipeline.fit(X_train, y_train)
print('accuracy for %s is %s with chance level being %s'
%(name, accuracy_score(pipeline.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy for SVC is 0.8974358974358975 with chance level being 0.5
accuracy for kNN is 0.8717948717948718 with chance level being 0.5
how about the
regressiontask?
from sklearn.neighbors import KNeighborsRegressor
X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
pipe_knn = make_pipeline(
StandardScaler(),
KNeighborsRegressor(n_neighbors=3))
pipe_reg = make_pipeline(
StandardScaler(),
LinearRegression())
for pipeline, name in zip([pipe_reg, pipe_knn], ['Reg', 'kNN']):
pipeline.fit(X_train, y_train)
print('mean absolute error for %s in years: %s against a data distribution from %s to %s years'
%(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max()))
mean absolute error for Reg in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for kNN in years: 4.0175096672735044 against a data distribution from 3.518138261 to 39.0 years
Question for both tasks: which estimator do you choose and why?
https://c.tenor.com/yGhUqB860GgAAAAC/worriedface.gif
Last but not least, another very popular model: tree-ensembles
Supervised learning - tree-ensembles¶
e.g. Random forest
construction of multiple decision trees
classification and regression
class selected by most trees or
mean/average prediction
utilization of entire dataset or subsets thereof
bagging or bootstrapping
Pros
reduces overfitting in decision trees
tends to improve accuracy
addresses missing values
scaling of input not required
Cons
expansive regarding computational resources and training time
reduced interpretability
small changes in data can lead to drastic changes in tress
now that we’ve heard about it, we’re going to put it to work
comparable to the
nearest neighborsmodel, we’ll check out for bothclassificationandregressiontaskswe will also compare it to the other
models
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
at first, within a
classification task:
X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
pipe_knn = make_pipeline(
StandardScaler(),
KNeighborsRegressor(n_neighbors=3))
pipe_reg = make_pipeline(
StandardScaler(),
LinearRegression())
for pipeline, name in zip([pipe_reg, pipe_knn], ['Reg', 'kNN']):
pipeline.fit(X_train, y_train)
print('mean absolute error for %s in years: %s against a data distribution from %s to %s years'
%(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max()))
accuracy for SVC is 0.8974358974358975 with chance level being 0.5
accuracy for kNN is 0.8717948717948718 with chance level being 0.5
accuracy for RFC is 0.9487179487179487 with chance level being 0.5
Oooooh damn, it gets better and better: we nearly got a perfect accuracy score. I can already see our Nature publication being accepted…
https://c.tenor.com/wyaFBOMEuskAAAAC/curious-monkey.gif
Maybe it does comparably well within the regression task? Only one way to find out…
X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
pipe_rfc = make_pipeline(
StandardScaler(),
RandomForestRegressor(random_state=0))
pipe_knn = make_pipeline(
StandardScaler(),
KNeighborsRegressor(n_neighbors=3))
pipe_reg = make_pipeline(
StandardScaler(),
LinearRegression())
for pipeline, name in zip([pipe_reg, pipe_knn, pipe_rfc], ['Reg', 'kNN', 'RFC']):
pipeline.fit(X_train, y_train)
print('mean absolute error for %s in years: %s against a data distribution from %s to %s years'
%(name, mean_absolute_error(pipeline.predict(X_test), y_test), Y_con.min(), Y_con.max()))
mean absolute error for Reg in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for kNN in years: 4.0175096672735044 against a data distribution from 3.518138261 to 39.0 years
mean absolute error for RFC in years: 3.446379857440512 against a data distribution from 3.518138261 to 39.0 years
Won’t you look at that? We got half a year better…nice!
However, what do you think about it?
Now that we’ve spent a fair amount of time to evaluate how we can use the information we already have (labels) to predict a given outcome (Y), we will have a look on the things we can learn from the data (X) without using labels.
Unsupervised learning¶
goal: extract information about
X
as mentioned before, within
unsupervised learning problems, we have twotask typesdecomposition & dimension reduction: PCA, ICA
clustering: kmeans, hierarchical clustering
comparable to the
supervised learningsection, we will try each and check what hidden treasures we might discover in our dataset (X)
Unsupervised learning - decomposition & dimensionality reduction¶
goal: extract information about
X
Unsupervised learning - PCA¶
compute principle components of data & change its basis
eigenvectors of covariance matrix obtained via SVD
low dimensional representation of data
variance preservation
directions on orthonormal basis
data dimensions linearly uncorrelated
Pros
remove correlated
featuresimprove performance
reduce overfittung
Cons
less interpretable
scaling required
some information lost
Excited about the PCAs of our X? We too!
In general the analysis pipeline and setup doesn’t differ that much between supervised and supervised learning. At first we need to import the class(es) we need:
from sklearn.decomposition import PCA
Next, we need to set up our estimator, the PCA, defining how many components we want to compute/obtain. For the sake of simplicity, we will use 2.
pipe_pca = make_pipeline(
StandardScaler(),
PCA(n_components=2))
With that, we can already fit it to our X, saving the output to a new variable, which will be a decomposed/dimensionality reduced version of our input X:
data_pca = pipe_pca.fit_transform(data)
We can now evaluate the components:
data_pca.shape
(155, 2)
Question: What does this represent, i.e. can you explain what the different dimensions are?
We can also plot our components and factor in our labels again to check if, for example, the two components we obtained distinguish age-related variables we tried to predict in the supervised learning examples:
information.head(n=5)
| participant_id | Age | AgeGroup | Child_Adult | Gender | Handedness | |
|---|---|---|---|---|---|---|
| 0 | sub-pixar123 | 27.06 | Adult | adult | F | R |
| 1 | sub-pixar124 | 33.44 | Adult | adult | M | R |
| 2 | sub-pixar125 | 31.00 | Adult | adult | M | R |
| 3 | sub-pixar126 | 19.00 | Adult | adult | F | R |
| 4 | sub-pixar127 | 23.00 | Adult | adult | F | R |
How about the categorical variable Child_Adult?
labels = {
str(i): f"PC {i+1} ({var:.1f}%)"
for i, var in enumerate(pipe_pca[1].explained_variance_ratio_ * 100)
}
fig = px.scatter_matrix(
data_pca,
labels=labels,
dimensions=range(2),
color=information["Child_Adult"]
)
fig.update_traces(diagonal_visible=False)
# fig.show()
init_notebook_mode(connected=True)
plot(fig, filename = 'pca.html')
display(HTML('pca.html'))
Not a “perfect” fit, but definitely looks like the PCA was able to compute important components of our data that nicely separate our groups.
We could now work further with our components, e.g. keeping it in the realm of dimensionality reduction and thus using them as X within a supervised learning approach or further evaluating them and test if they also separate more fine-grained classes in our dataset like the AgeGroup or even Age.
However, given our unfortunate time constraints, we will continue with the next decomposition/dimensionality reduction approach: ICA.
Unsupervised learning - ICA¶
compute additive subcomponents of data
minimize mutual information, maximize non-Gaussianity
special case of blind source separation
problem of underdetermination & set of possible solutions
preprocessing
whitening, centering, dimensionality reduction
Pros
Cons
Alrighty, let’s see how it performs on our dataset!
You guessed right, we need to import it first:
from sklearn.decomposition import FastICA
The rest works as with the PCA: we define our analysis pipeline
pipe_ica = make_pipeline(
StandardScaler(),
FastICA(n_components=2))
and fit it to our dataset:
data_ica = pipe_ica.fit_transform(data)
Coolio! As with PCA, we obtain two components:
data_ica.shape
(155, 2)
However, this time being additive instead of orthogonal.
Any guesses on how things might look like? We can easily check that out.
Question: When would you apply PCA and when ICA?
Decomposition & dimensionality reduction is quite fun, isn’t it? Do you think the second set of unsupervised learning tasks, i.e. clustering can beat that? Only one way to find out…
Unsupervised learning - clustering¶
goal: extract information about
X
We saw that we can use decomposition and dimensionality reduction approaches to unravel important dimensions of our data X. But can we also discover a certain structure in an unsupervised learning approach? That is, would it be possible to divide our dataset X into groups or clusters? We will employ two approaches: kmeans and hierarchical clustering to find out!
Unsupervised learning - kmeans¶
-
partition n observations into
k clusters
cluster based on nearest mean (center/centroid)
partitioning of the data space into Voronoi cells
minimizes within-cluster variances
Pros
Cons
Now it’s time to test it on our dataset. After importing the class:
from sklearn.cluster import KMeans
we add it to our pipeline and apply it:
pipe_kmeans = make_pipeline(
StandardScaler(),
KMeans(n_clusters=2))
pipe_kmeans.fit(data)
Pipeline(steps=[('standardscaler', StandardScaler()),
('kmeans', KMeans(n_clusters=2))])
Unsupervised learning - hierarchical clustering¶
build hierarchy of clusters
agglomerative/bottom up
divisive/top-down
-
merges & splits
Pros
Cons
Well well well, how will hierarchical clustering perform in our dataset X?
from sklearn.cluster import AgglomerativeClustering
pipe_clust = make_pipeline(
StandardScaler(),
AgglomerativeClustering(n_clusters=2))
pipe_clust.fit(data)
Pipeline(steps=[('standardscaler', StandardScaler()),
('agglomerativeclustering', AgglomerativeClustering())])